Skip to main content
This guide walks you through setting up your own Large Language Model (LLM) server using Ollama on an Ubuntu VM with NVIDIA GPU passthrough in Proxmox.

What is This Setup?

This configuration allows you to:
  • Run LLMs locally on your own hardware with GPU acceleration
  • Host models like Llama, Mistral, or GPT-OSS for private AI inference
  • Achieve faster response times compared to CPU-only inference
  • Maintain data privacy by keeping everything on your infrastructure
  • Access your LLM via web UI similar to ChatGPT
Unlike cloud-based LLM services, this setup gives you complete control over your models, data, and costs.

What is GPU Passthrough?

GPU passthrough (also called PCIe passthrough) allows a virtual machine to directly access a physical GPU, bypassing the hypervisor layer. This means:
  • Near-native performance: Your VM gets almost the same GPU performance as bare metal
  • Direct hardware access: The VM controls the GPU as if it were physically installed
  • Exclusive access: Only one VM can use the passed-through GPU at a time
  • Required for GPU compute: Essential for running LLMs with GPU acceleration in VMs
Without GPU passthrough, your LLM would run on CPU only, which is 10-100x slower than GPU-accelerated inference.

Prerequisites

Before starting, you need:
  1. A Proxmox server with an NVIDIA GPU installed
  2. An Ubuntu Server VM (22.04 or later recommended)
  3. Docker installed on the VM
  4. Basic familiarity with Linux command line
  5. SSH access to your VM
  6. Sufficient VRAM on your GPU
GPU recommendations:
  • Small models (7B parameters): 8GB VRAM minimum
  • Medium models (13B-20B): 16GB+ VRAM
  • Large models (30B+): 24GB+ VRAM

Step 1: Configure GPU Passthrough in Proxmox

Follow this video tutorial to set up GPU passthrough from your Proxmox host to your Ubuntu VM: 📹 Proxmox GPU Passthrough Guide The video covers:
  • Enabling IOMMU in BIOS
  • Configuring Proxmox for PCIe passthrough
  • Adding the GPU to your VM
  • Verifying the setup
After completing the setup, verify GPU is visible in your VM:
lspci | grep -i nvidia
Expected output:
01:00.0 VGA compatible controller: NVIDIA Corporation ...
01:00.1 Audio device: NVIDIA Corporation ...
If you see NVIDIA devices listed, passthrough is working correctly.

Step 2: Install NVIDIA Drivers

The NVIDIA drivers enable your Ubuntu system to communicate with the GPU hardware. Follow this guide for driver installation on Ubuntu: 📖 NVIDIA Driver Installation Guide Quick verification after installation:
nvidia-smi
Expected output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03   Driver Version: 535.129.03   CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   35C    P8    15W / 250W |      0MiB / 16384MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
This confirms your GPU is detected and the driver is working.

Step 3: Install CUDA Toolkit

CUDA is NVIDIA’s parallel computing platform required for GPU-accelerated applications. Download and install CUDA from the official source: 📦 CUDA Toolkit Downloads Select your operating system, architecture, and distribution to get the appropriate installation commands.

Step 4: NVIDIA Container Toolkit

Install NVIDIA Container Toolkit: Follow the official installation guide to enable GPU access in Docker containers: 📖 NVIDIA Container Toolkit Installation Verify GPU access in Docker:
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
You should see the same nvidia-smi output as before, confirming Docker can access the GPU.

Step 5: Deploy Ollama and Open WebUI

Create a directory for your setup:
mkdir -p ~/llm-server
cd ~/llm-server
Create docker-compose.yml:
services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    restart: always
    environment:
      OLLAMA_KEEP_ALIVE: -1
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: always
    environment:
      ENABLE_ADMIN_CHAT_ACCESS: false
    ports:
      - "80:8080"
    volumes:
      - open-webui:/app/backend/data
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  ollama:
  open-webui:
What these services do:

Ollama Service

  • OLLAMA_KEEP_ALIVE: -1: Keeps models loaded in GPU memory indefinitely for instant responses
  • Port 11434: API endpoint for model inference
  • Volume: Persists downloaded models between restarts
  • GPU reservation: Ensures the container can access all available GPUs

Open WebUI Service

  • Port 80: Web interface accessible at http://your-vm-ip
  • ENABLE_ADMIN_CHAT_ACCESS: false: Disables admin user from accessing all chats (i mean.. its kinda creepy to check your employees chats)
  • host.docker.internal: Allows the web UI to communicate with Ollama
  • Volume: Stores user data, conversations, and settings
Start the services:
docker compose up -d
Verify both containers are running:
docker ps
Expected output:
CONTAINER ID   IMAGE                              STATUS         PORTS
abc123def456   ollama/ollama                      Up 2 minutes   0.0.0.0:11434->11434/tcp
def456abc789   ghcr.io/open-webui/open-webui:main Up 2 minutes   0.0.0.0:80->8080/tcp

Step 6: Access Open WebUI and Download Models

Open your web browser and navigate to:
http://your-vm-ip
You’ll see the Open WebUI interface. On first access, you’ll need to create an admin account. Download your first model:
  1. Click on your profile icon in the top right
  2. Go to Admin PanelSettingsModels
  3. In the “Pull a model from Ollama.com” field, enter a model name
  4. Click the download button
Or download via command line:
docker exec -it ollama ollama pull llama3.2:7b
The model will appear in Open WebUI’s model selector once downloaded.

Step 7: Verify GPU Acceleration

Check that your model is running on the GPU:
docker exec -it ollama nvidia-smi
Expected output showing GPU memory usage:
+-----------------------------------------------------------------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 45%   65C    P2   180W / 250W |   7234MiB / 16384MiB |     95%      Default |
+-----------------------------------------------------------------------------+
Check loaded models:
docker exec -it ollama ollama ps
Expected output:
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
llama3.2:7b     a80c4f17acd5    4.7 GB   100% GPU     4096       Forever
Critical indicators:
  • PROCESSOR: 100% GPU ✅ - Model is running on GPU (good!)
  • PROCESSOR: 100% CPU ❌ - Model fell back to CPU
  • UNTIL: Forever ✅ - Model stays loaded (due to OLLAMA_KEEP_ALIVE: -1)

Step 8: Customize Model Context Length

The context window determines how much text the model can remember in a conversation. Larger contexts allow for longer discussions but use more VRAM. Access Ollama’s interactive mode:
docker exec -it ollama ollama run llama3.2:7b
Set a custom context length:
>>> /set parameter num_ctx 10000
Set parameter 'num_ctx' to '10000'
⚠️ Important: These changes are temporary and lost when the model unloads! Make context changes permanent:
>>> /set parameter num_ctx 10000
Set parameter 'num_ctx' to '10000'
>>> /save llama3.2:7b-10k
Created new model 'llama3.2:7b-10k'
>>> /bye
What this does:
  1. Sets context to 10,000 tokens
  2. Saves as a new model variant with the custom context
  3. The new model persists these settings permanently
⚠️ VRAM Warning: Setting context too high can exhaust your GPU memory, causing the model to fall back to CPU (much slower). Always monitor VRAM usage with nvidia-smi after changing context length. Verify your custom model:
docker exec -it ollama ollama ps
Expected output:
NAME               ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
llama3.2:7b-10k    a80c4f17acd5    4.7 GB   100% GPU     10000      Forever
Your custom model will now appear in Open WebUI’s model selector. 📖 If you are having problems checkout the Official Ollama Troubleshooting Guide